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Abstract 

Background: Decreasing costs of DNA sequencing have made prokaryotic draft genome sequences increasingly 
common. A contig scaffold is an ordering of contigs in the correct orientation. A scaffold can help genome 
comparisons and guide gap closure efforts. One popular technique for obtaining contig scaffolds is to map contigs 
onto a reference genome. However, rearrangements that may exist between the query and reference genomes may 
result in incorrect scaffolds, if these rearrangements are not taken into account. Large-scale inversions are common 
rearrangement events in prokaryotic genomes. Even in draft genomes it is possible to detect the presence of 
inversions given sufficient sequencing coverage and a sufficiently close reference genome. 

Results: We present a linear-time algorithm that can generate a set of contig scaffolds for a draft genome sequence 
represented in contigs given a reference genome. The algorithm is aimed at prokaryotic genomes and relies on the 
presence of matching sequence patterns between the query and reference genomes that can be interpreted as the 
result of large-scale inversions; we call these patterns inversion signatures. Our algorithm is capable of correctly 
generating a scaffold if at least one member of every inversion signature pair is present in contigs and no inversion 
signatures have been overwritten in evolution. The algorithm is also capable of generating scaffolds in the presence 
of any kind of inversion, even though in this general case there is no guarantee that all scaffolds in the scaffold set will 
be correct. We compare the performance of sis, the program that implements the algorithm, to seven other 
scaffold-generating programs. The results of our tests show that sis has overall better performance. 

Conclusions: sis is a new easy-to-use tool to generate contig scaffolds, available both as stand-alone and as a web 
server. The good performance of sis in our tests adds evidence that large-scale inversions are widespread in 
prokaryotic genomes. 
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Background 

With the decreasing costs of DNA sequencing it is now 
very common for prokaryotic genomes to be sequenced 
at "draft" status only. This means that the generated 
sequence will be a set of contigs (a contig is a substring 
of the string over the DNA alphabet that represents the 
genome sequence). The number of contigs depends on the 
sequencing fold coverage and DNA sequencing technol- 
ogy, and typically varies between half a dozen to a few 
hundred. 
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As of December 9, 2011, the number of draft micro- 
bial genome sequences in GenBank a is 2324, compared to 
1814 complete sequences. This difference is growing (in 
favor of draft sequences) over time (on March 15, 2011 
the numbers were 1821 and 1485, respectively). There- 
fore there is an increasing need for tools that can improve 
the sequencing and assembly results beyond a simple 
contig set. 

One technique that can be used to improve automated 
assembly results is to generate contig scaffolds from the 
contig set. A scaffold is an ordered set of contigs, with 
the desired order being the correct genome order, and 
with each contig in the correct orientation. Scaffolds help 
in whole genome comparisons and they can guide the 
finishing process, showing where the gaps are. 
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Some references in the literature adopt the term scaf- 
fold only for the situation where the contig ordering and 
orientation is given by paired-end read information [1]. 
Here we adopt a more general meaning, assuming a scaf- 
fold is a contig ordering obtained by any technique and/or 
additional data, or combination thereof. 

There are various techniques that can help to create a 
scaffold. In addition to paired-end reads, one can use data 
from physical [2] or optical maps [3,4]. Information about 
the order of specific contigs sometimes can be obtained 
by searching for genes that have been split by the gaps 
between those contigs. 

Another technique, which became popular in the last 
few years, is to use a reference genome. In this case 
we assume the query (draft) genome A has a close phy- 
logenetic relative B that has been fully sequenced, and 
the genome of B can be used to guide the assembly of 
A and to generate a scaffold as well. This technique is 
used by programs such as ABACAS [5], fillScaffolds [6], 
Mauve Aligner [7], OSLay [8], Projector 2 [9], r2cat [10], 
PGA4genomics [11], and CONTIGuator [12]. 

When using a reference genome B to create a scaffold 
for A, one possible problem is the existence of rear- 
rangements in A with respect to B. In prokaryotes, the 
most common rearrangement is the inversion [13], lead- 
ing to the 'X' patterns that are frequently seen in dot- 
plot graphs representing whole genome alignments of 
prokaryotic genomes [14] (when inversions are symmetric 
with respect to the origin of replication). 

In this work we present an algorithm for obtaining con- 
tig scaffolds that explicitly takes into consideration the 
presence of inversions in A with respect to B. We do this 
by searching for inversion signatures in the input con- 
tigs. To our knowledge the only existing program that 
deals directly with inversions is fillScaffolds [6] (FillScaf- 
folds deals with several other kinds of rearrangements 
besides inversions.). One other program seems to deal 
at least indirectly with rearrangements: Mauve Aligner 
[7], which is based on the multiple genome alignment 
program Mauve [15] and on GRIL [16]. 

We created program SIS as an implementation of our 
algorithm, and tested it on real draft genomes that have 
been completed, comparing its performance to seven 
other programs. The criterion used to select genome 
sequences for testing was solely the availability of each 
genome in two formats: incomplete (several contigs) and 
complete (one contig per replicon). In these tests SIS had 
the best overall performance. 

Methods 

Definitions 

A replicon is a self-replication unit of a genome, such as a 
chromosome or a plasmid. 



We represent a pair of single-replicon genomes by 
signed permutations, where numbers represent conserved 
genes or blocks between the two genomes, and the signs 
represent the plus or minus strand. We represent the 
reference genome by the identity permutation i n = 
[ +1, +2, +3, . . . , +«], and the query genome by the per- 
mutation 7i =[ 7Ti, 7T2, 7T3, . . . , Tt n ], where 1 < \7ti\ < n and 

\7ti\ ^ \7Tj\ Hi £j. 

An inversion p(hj) is a rearrangement event 
that reverses the order and the signs of a 
consecutive section of a genome: Ttp(i,j) = 

[ TTi, . . . , 7T/-1, —7tj, . . . , — Tti, Ttj+l, . . . , 7T n ], SUCh that 

1 < i < j < n< Two consecutive conserved blocks 
(7Tj, 7Tj + i) are a breakpoint if Tti 7^ TTj+i — h otherwise the 
consecutive blocks are an adjacency, A strip is a maximal 
substring [ 717, 7T;+i, . . . , Try] such that every pair {nj <} 7i>+i) 
is an adjacency, for i < k < y. We say that an inversion 
p(r, s) acts on a strip [ 777, 7T;+i, . . . , tcj] if / < r < s < j. 

An inversion signature (IS) is a breakpoint (jiu ?r/+i) 
such that Tti and 7T/ + i have different signs. Two ISs 
(jti, tt/+i) and (jtj, 7T/+i) are an IS pair if Tti = —Ttj — 1 and 
7T/ + i = — 7T/ + i + 1 (see Figure 1); our concept of inver- 
sion signature is unrelated to a concept of similar name 
presented in [17]. 

An inversion p (/,;) is symmetric if n = i+j— 1 (Figure 2). 
An inversion p(h,ji) is nested with respect to an inver- 
sion p(hi)2) if h < h < h < h (Figure 3). An inversion 
p(hj) is safe with respect to Tt if p(i,j) acts on a strip 
[ 7T r , 7r r +i, . . . , Tt s ] of Tt (Figure 4). An inversion p(i,j) with- 
out any restrictions on the values of i and j is a generic 
inversion (Figure 5). 

Any sequence of symmetric inversions can be sorted so 
that the result is a series of nested inversions, with the 
same effect on a genome. Every series of nested inversions 
is a series of safe inversions, but the converse is not true. 

Let P\(n), Pi(n), P<$(ri) and P^(n) denote the sets of all 
permutations that can be obtained from i n by applying a 
series of symmetric inversions, a series of nested inver- 
sions, a series of safe inversions, and a series of generic 
inversions, respectively. Clearly, Pi (n) C P2M C P%(ri) C 
P^n). 

A collection of m contigs from a genome Tt is rep- 
resented by a collection of substrings Ci, C2, . . C m , 
such that, for 1 < k < m, Q- =[Tti k > • • • ,7tj k ] or = 
[ — Ttj k , . . . , — Tti k ], and the intervals in Tt that correspond to 
any two contigs C/ q and Cj <2 cannot have an overlap, that 
is [ iki-)h} H[ ik2»jk 2 ] = ®> for 1 < k i < h < m. 

Algorithm 

The algorithm we have developed generates correct scaf- 
folds in the presence of symmetric, nested, or safe inver- 
sions (as long as at least one IS is present for every IS pair); 
in addition it can also produce scaffolds, not necessarily 
100% correct, for generic inversions. 
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Figure 1 Inversion Signatures. The figure shows two regions with inversion signatures. The region o\ -aj in the draft genome matches regions b\ 
and b3 in the reference genome. b\ and 63 are not consecutive in the reference genome and hence this defines a breakpoint; moreover, because 
the match between ay and by has a diffferent orientation as compared to the match between a 2 and 63 this breakpoint is an inversion signature. 
The same happens with region 03-04 of the draft genome and regions 62 and 64 of the reference genome. Both ISs are related because they are the 
result of the same inversion event. Therefore these two ISs constitute an IS pair. 



The algorithm depends on the following assumptions 
in order to generate one single correct scaffold: (1) There 
are no duplicated conserved blocks in either of the two 
genomes; (2) all contigs are correctly assembled; (3) 
both genomes contain only one chromosome; and (4) 
conserved block 01 is in the first position (with + sign) or 
in the last position (with - sign), in some contig. 

Assumption (1) means that there is no conserved block 
that contains a sequence that is a repeat in any of the two 



genomes. This is somewhat unrealistic, since all genomes 
in different degrees have repeats. In this sense, assumption 
(2) also depends on the lack of repeats. But as the tests 
below show, these assumptions do not seem to be major 
obstacles. In particular, repeats account for no more than 
13% of errors in contig adjacencies in our tests. We require 
assumption (3) primarily for simplicity of exposition; in 
a real setting it is generally possible to decompose scaf- 
fold creation for separate replicons as separate problems. 



1 2 3 4 5 6 7 8 



N | / P(3,6) 
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Figure 2 Symmetric Inversions. Three symmetric inversions p(3, 6), 
p(4, 5) and p(2, 7) applied to the identity permutation iq. The final 
result does not depend on the order of the inversions. The colored 
circles represent IS pairs created by the inversions. 
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Figure 3 Nested Inversions. Three nested inversions 7), p(3,6) 
and p(5, 5) applied to the identity permutation l%. The final result 
may change if the order of the inversions changes. The colored circles 
represent IS pairs created by the inversions. 
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Figure 4 Safe Inversions. Four safe inversions p(8, 8), p(1 , 2), p(4, 6) 
and p(5, 5) applied to the identity permutation i$. The colored circles 
represent IS pairs created by the inversions. 



Assumption (4) helps simplify the algorithm description. 
In case block 01 does not follow assumption (4), a remap- 
ping of conserved blocks can be made in time 0(n) such 
that the result does follow assumption (4). 

The algorithm can detect lack of information, caused by 
absence of an IS pair in the input or by the overwriting 
of an earlier (in evolution) IS by a later unsafe inversion, 
and still generate a scaffold. In such cases the algorithm 
chooses a contig that has not been positioned yet, and 
begins generation of a new scaffold (so the result will be 
a set of scaffolds). When IS pairs are missing there are no 
guarantees that the scaffolds generated are correct. 

The input for the algorithm is a list of contigs, each 
contig represented by the conserved blocks it contains. 
Note that we deal only with conserved blocks; insertions 
and deletions of the query genome with respect to the 
reference genome are not considered. 

We now give a detailed example to illustrate the various 
concepts as well as to present the algorithm. In this exam- 
ple, there are 40 conserved blocks distributed in 9 contigs 
as follows: 



Contig 1 
Contig 2 
Contig 3 
Contig 4 
Contig 5 
Contig 6 
Contig 7: 
Contig 8 
Contig 9 



[ -23, -22, +19, +20] 
[-15, -14, +12, +13, -11] 
[+34, +35] 

[ +36, +37, +38, +39, +40] 

[ -03, -02, -01] 

[ +05, +06, +07, +08] 

[+31, +32, +33,-30,-29] 

[ -28, -27, +26, -25, -24, +04] 

[ +21, -18, -17, -16, +09, +10] 



The correct order for the blocks in the query genome is 
dven as follows: 



7T =[ +01, +02, +03, -23, -22, +19, +20, +21, 
- 18, -17, -16, +09, +10, +11, -13, -12, 
+ 14, +15, -08, -07, -06, -05, -04, +24, 
+ 25, -26, +27, +28, -35, -34, +31, +32, 
+ 33, -30, -29, +36, +37, +38, +39, +40] 



Recall that the reference genome is the identity permu- 
tation. The IS pairs are the following: 



IS Pair 1 


[+03,-23] 


X 


[-04, +24] 


IS Pair 2 


[-22, +19] 


X 


[+21,-18] 


IS Pair 3 


[-16, +09] 


X 


[+15, -08] 


IS Pair 4 


[+11,-13] 


X 


[-12, +14] 


IS Pair 5 


[ +25, -26] 


X 


[ -26, +27] 


IS Pair 6 


[+28,-35] 


X 


[ -29, +36] 


IS Pair 7 


[-34, +31] 


X 


[+33,-30] 



An inspection reveals that the query genome has 7 safe 
inversions with respect to the reference genome. However, 



^P(l,5) 
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O 
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3 4 
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~J7(7,7) 



O 



2^-4 -3_ 5 ^ -1 _ 6 -7 ^ 8 
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Figure 5 Generic Inversions. Five generic inversions p(1,5), p(1,4), 
p(2, 3), p(7, 7) and p(3, 7) applied to the identity permutation t 8 . The 
white circles represent ISs overwritten by nonsafe inversions. 
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note that only the following ISs are present in the 
contigs: 



IS Pair 1 
IS Pair 2 
IS Pair 3 
IS Pair 4: 
IS Pair 5 
IS Pair 7 



[ -22, +19] 
[-16, +09] 
[+11,-13] 
[+25,-26] 



x 
x 

X 



[ -04, +24] 
[+21,-18] 

[-12, +14] 
[-26, +27] 
[ +33, -30] 



Note that IS pair 6 is missing from the input contigs. 

The algorithm builds the scaffold incrementally, choos- 
ing one contig following certain rules and placing it in the 
scaffold immediately after the previously chosen contig. 
The first contig chosen is the one that contains block +01 
(or -01). 

Execution of the algorithm on the example 

In the example, block 01 is found inverted in the last posi- 
tion of contig 5 (see Figure 6). Thus, the algorithm reverse 
complements contig 5 and inserts it as the first component 
of the scaffold. The next conserved block searched is +04, 
which should be in the first position of its contig on the 
plus strand or in the last position of its contig on the minus 
strand. Block 04 is found in the last position of contig 8 on 
the minus strand, which is interpreted by the algorithm to 
be the result of a safe inversion. The algorithm then uses 
the IS (-24,+04) to search for block -23, which should be 
at the other end of the inversion that created the IS found. 
Block —23 is found in the first position of contig 1, so this 
contig becomes the second in the scaffold. 

The next conserved block to be searched is +21. It is 
found at the beginning of contig 9, on the plus strand 
as expected, so this contig is placed in the third scaffold 
position. The next block searched is +11, which is found 
inverted in the last position of contig 2. This means that 



contig 2 will be inserted in the fourth scaffold position 
after it is reverse-complemented. The next block to be 
searched is +16, which is found on the minus strand in the 
middle of the already placed contig 9. The algorithm then 
uses the IS (—16, +09) to determine that the next block to 
be searched is —08. Block —08 is found on the plus strand 
in the last position of contig 6. The subsequent steps 
proceed along the same lines. The resconstructed scaf- 
fold is [ -C 5 , -Q, +C 9 , -C 2 , -C 6 , -C 8 , -C 7 , +C 3 , +C 4 ]. 
This scaffold contains errors, because the 
correct scaffold given by permutation n is 
[ -C 5 , -Q, +C 9 , -C 2 , -C 6 , -C 8 , -C 3 , +C 7 , +C 4 ]. This 
was caused by the fact that from the given input it is 
impossible to tell that contig 3 should remain as it is 
and that contig 7 should be reverse complemented. This 
stems from the fact that the endpoints of IS pair 6 are not 
to be found in the input contigs. But the algorithm still 
managed to get 7 correct adjacencies out of 9. 

Pseudocode for the algorithm is given in Figure 7. 
The algorithm assumes that the following data struc- 
tures are available and filled in: contigs stores 
contig information and inContig contains con- 
served block information, storing for each conserved 
block the position in structure contigs where it 
can be found, as well as the contig number and 
the sign (strand). More formally, if contigs [c] 

[p] = x, then inContig [ | x | ] .contig = c, 
inContig [ | x | ] .pos = p and inContig [ | x | ] . 
sign = s, such that s x \x\ = x. For example, 
contigs [7] =[ +31, +32, +33, -30, -29], contigs 

[7] [4] = -30, inContig [ | -30 | ] .contig = 
7, inContig [ | -3 0 | ] .pos = 4 and inContig 

[ | -30 | ] . sign = -1. 

In addition to the data structures described above 
the algorithm uses two auxiliary vectors: used indi- 
cates which contigs have been placed in the scaf- 
fold; and searched records conserved blocks that 
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Figure 6 Scaffold Generated in the Presence of Safe Inversions. Illustration of the example discussed in the text of a genome with 40 conserved 
blocks with respect to a reference genome. The query genome and the reference genome differ by seven safe inversions, denoted by the nested 
horizontal square brackets. The conserved blocks appear in nine contigs. The permutation on the top is the "real" query genome, and the 
permutation at the bottom is the scaffold reconstructed by the algorithm. The bars with alternating colors help visualize inversion signatures. 
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Algorithm 1: Scaffolding Algorithm 



1 
2 
3 
4 
5 
6 

7 

8 



10 
11 
12 
13 
14 
15 

16 
17 
18 
19 
20 



21 
22 
23 
24 

25 



26 
27 
28 

29 
30 
31 
32 
33 
34 

35 
36 
37 
38 
39 



Data: contigs, inContig, m, n 
Result: scaffold 

// Initialize temporary variables 
for i 4— 1 to m do used[i] 4— false 

for i < n to +n do searched [i] <(— false 

searched [n+1] 4— true 
nextC 4- 1 
next <— 1 
i <- 0 

// While not all contigs positioned 
while i < m do 
ok ^— false 

// If searching for a valid element 
if next / n + 1 then 

// Search for the element in the contigs 

e ^— inContig [ | next | ] 
if sign(next) = e.sign then 
if e.pos / 1 then 

next 4— contigs [e.contig] [e.pos— 1] + 1 
else 

ok «— true 

else 

if e.pos 7^ len( contigs [e.contig]) then 

next 4 contigs [e.contig] [e.pos+1] + 1 

else 

ok 4— true 



// If searching a new element 

if not (searched [next]) then 

searched [next] <(— true 
else 

ok <(— true 

// If OK to place a contig in the scaffold 
if ok then 

//If the contig has already been 
positioned in the scaffold 

if next = n + l or used [e.contig] then 
next 4— contigs [next C][l] 
e inContig [ | next | ] 

// Place found contig in the scaffold 

contigSign ^— sign (next) x e.sign 
scaffold [i] <r- contigSign x e.contig 
i <- i + 1 

used [e.contig] <(— true 
while nextC < m and used[nextC] do 
nextC <(— nextC + 1 

// Determine next element to search 
if sign(next) = e.sign then 

lastlndex ^— len( contigs [e.contig]) 
next contigs [e.contig] [lastlndex] + 1 
else 

next 4 contigs [e.contig] [1] + 1 



40 return scaffold 



Figure 7 Algorithm to determine a genome scaffold based on inversion signatures. Data structures used are explained in the text. 



Dias et al. BMC Bioinformatics 201 2, 1 3:96 
http://www.biomedcentral.eom/1 471 -21 05/1 3/96 



Page 7 of 12 



have been searched so far. Variable nextC indi- 
cates the contig with smallest ID not yet placed in 
the scaffold. 

The main loop in lines 7-39 is executed until all con- 
tigs have been placed in the scaffold. Variable ok indicates 
whether the algorithm found a candidate contig to be 
placed in the scaffold. Line 9 ensures that the algorithm 
is not searching for an invalid conserved block. Lines 21- 
24 check whether the algorithm is searching an already 
searched conserved block. 

If the algorithm decides to insert a new contig in the 
scaffold (test in line 25) lines 26-39 are executed. If no 
valid contig was found, the first contig not yet placed is 
determined in lines 26-28. Lines 29-30 place that contig 
as the first in a new scaffold. Vector used and variables 
nextC and i are updated in lines 31-34. 

Recalling that n is the number of conserved blocks and 
m is the number of contigs, with n > rn, we now state 
the algorithms complexity. Data structure inCont ig can 
be built in O(n). Initialization takes 0(n + m). The main 
loop is executed n + m times, since in each iteration 
either a new position in vector searched is marked or 
a new contig is placed in the scaffold. There is one oper- 
ation inside the main loop that does not take constant 
time, which is the loop in lines 33-34. However, variable 
nextC can be incremented at most m times, so the amor- 
tized cost of each update is 0(1). This leads to an overall 
complexity of 0{n). 

Testing Methodology 

We have implemented the algorithm creating program SIS 
(Scaffolds from Inversion Signatures). We stress that SIS 
does not require the algorithm assumptions listed above 
in order to be used. We used real contigs available in 
GenBank for testing as described below. 

The main quality measure for a scaffold is the number of 
correct contig adjacencies. A contig adjacency for contigs 
x and y is correct if they are consecutive in the (completed) 



query genome as well as in the generated scaffold set. In 
the case of circular replicons, all correct scaffolds with m 
contigs have exactly m correct adjacencies. Even though 
SIS in general outputs a set of scaffolds, for the purposes 
of this test we consider the result to be just one scaffold, 
with scaffolds being placed side by side as the algorithm 
creates them. 

A second quality measure is genome coverage. In this 
measure we attempt to determine how much of the 
genome to which the contigs belong is actually covered 
by the scaffolds generated. The working definition we use 
is as follows. If both ends of a contig have correct adja- 
cencies, then we count the entire length of that contig as 
contributing to the coverage. If only one end has a cor- 
rect adjacency, then half of the contig length is used. If 
both ends have incorrect adjacencies, then the contig is 
not used. We define coverage as the ratio of the sum of 
contig lengths considered per the above rules and the sum 
of all contig lengths. 

In this test we compared SIS to seven other scaffold gen- 
erating programs, namely ABACAS [5], fillScaffolds [6], 
Mauve Aligner [7], OSLay [8], Projector 2 [9], r2cat [10], 
and CONTIGuator [12]. Table 1 gives a summary of all 
programs tested. 

It is important to mention that among programs tested 
fillScaffolds [6] was the only one designed specifically for 
eukaryote sequences, whereas all of our tests are done 
on prokaryote sequences. FillScaffolds is a sophisticated 
tool that takes into account many kinds of rearrangements 
in addition to inversions, such as transpositions, recipro- 
cal translocations, and chromosome fusions and fissions. 
We have included fillScaffolds in the present study for 
completeness. 

One important consideration is block conservation 
detection. In the case of SIS this was done by programs 
nucmer [18] and promer [18], with default settings. We 
postprocessed the outputs using the MUMmer script 
delta-filter [18] with parameter —1, which ensures that 
only nonrepeated blocks are processed. 



Table 1 Programs Tested 





Prokaryote vs. 


Unichromosomal vs. 


Standalone 


Block 


User 




Eukaryote 


Multichromosomal 


vs. Server 


Detection 


Interface 


Abacas 


Prokaryote 


Multichromosomal 


Standalone 


nucmer, promer 


Command line 


CONTIGuator 


Prokaryote 


Multichromosomal 


Standalone 


Blast+ 


Command line 


fillScaffolds 


Eukaryote 


Multichromosomal 


Standalone 


nucmer, promer 


Command line 


Mauve Aligner 


Prokaryote 


Unichromosomal 


Standalone 


Mauve 


GUI 


OSLay 


Both 


Unichromosomal 


Standalone 


nucmer 


GUI 


Projector2 


Prokaryote 


Unichromosomal 


Server 


BLAT 


Web 


r2cat 


Both 


Multichromosomal 


Standalone 




GUI 


SIS 


Prokaryote 


Unichromosomal 


Both 


nucmer, promer 


Web, Command line 
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Mauve Aligner uses progressiveMauve [15] for sequence 
comparison. OSLay can use nucmer and that is what we 
chose (with default settings). r2cat does sequence com- 
parison internally. Projector 2 can use BLAT [19] (default) 
or BLAST [20], and we chose BLAT. CONTIGuator uses 
BLAST+ [21]. ABACAS can use nucmer or promer. We 
chose nucmer. (We observed that choosing promer made 
ABACAS run for more than two days without provid- 
ing any scaffold.) For fillScaffolds [6] we have used both 
nucmer and promer. 

Input genomes 

We have tested the programs on real contigs from pre- 
liminary assemblies of 19 genomes. These genomes have 
been finished, so that we know the correct order for 
contigs in all cases. The list of genomes used is given 
in Table 2. In the table, column cov contains the frac- 
tion of each chromosome covered by the contigs selected. 
Column 'tech.' describes de sequencing technology used, 
when this information was available in the correspond- 
ing GenBank file. Four of the genomes in Table 2 have 
two chromosomes; each chromosome was dealt with sep- 
arately, thus creating a total of 23 test cases. The genomes 



chosen were those available from GenBank b on December 
7, 2011, for which complete genomes were also available 
(with at least 4 contigs for each genome). These genomes 
are quite diverse phylogenetically, representing seven dif- 
ferent bacterial groups (where "group" comes from the 
GenBank table of completed genomes and generally cor- 
responds to taxonomic phylum or class) and one archaeal 
group. No other filter was applied in this selection. The 
sequencing technologies represented in this table include 
at least one representative of a genome sequenced with 
454, Solexa/Illumina, and/or ABI SOLiD. The genome 
sequences we used in our tests (draft and complete) are 
available at the website http://marte.ic.unicamp.br:8747. 

For each of the query chromosomes we obtained a list 
of the 20 closest genomes (excluding the query genome 
itself), among 1331 complete prokaryotic genomes avail- 
able at GenBank on May 5, 2011. We used NUCMi [22] 
(a variation of MUMi [23]) to compute the distance 
between each of the 1331 genomes and the query genome. 
Our rationale for selecting 20 closest other genomes 
instead of only the closest was as follows. In practice, the 
choice of a reference genome should be guided by phylo- 
genetic distance. The closest genome to the draft genome 



Table 2 Chromosome Sequences Used in Tests 



Genome 


Chromosome 


Size (bp) 


Contigs 


cov.(%) 


tech. 


Aciduliprofundum boonei T469 


NC_013926 


1486778 


29 


89.36 




Bacillus subtilis 1 68 


NCD00964 


4215606 


5 


99.98 


Solexa 


Bifidobacterium long urn DJ01 OA 


NCD10816 


2375792 


43 


60.02 




Brucella melitensis bv 1 1 6M 


NC.003317 


2117144 


41 


91.06 


454 


Brucella melitensis bv 1 1 6M 


NC.003318 


1177787 


12 


99.94 


454 


Brucella pinnipedialis B2 94 


NCD15857 


2138342 


55 


87.68 


454 


Brucella pinnipedialis B2 94 


NCD15858 


1260926 


34 


84.58 


454 


Burkholderia thailandensis E264 


NC.007650 


2914771 


15 


85.33 




Burkholderia thailandensis E264 


NC.007651 


3809201 


26 


75.90 




Chlamydia muridarum Nigg 


NCD02620 


1072950 


4 


99.09 


454 


Clostridium cellulovorans 743 B 


NCD14393 


5262222 


293 


94.16 


454/lllumina 


Corynebacterium aurimucosum ATCC 700975 


NC.012590 


2790189 


88 


85.71 


SOLiD 


Corynebacterium efficiens YS 3 1 4 


NC.004369 


3147090 


118 


95.90 


454 


Micrococcus luteus NCTC 2665 


NCD12803 


2501097 


121 


76.66 




Mycobacterium tuberculosis H37Ra 


NCD09525 


4419977 


220 


83.75 




Mycoplasma genitalium G37 


NC_000908 


580076 


24 


78.59 


454 


Saccharopolyspora erythraea NRRL 2338 


NCD09142 


8212805 


237 


96.43 


454 


Selenomonas sputigena ATCC 351 85 


NCD15437 


2568361 


49 


97.16 


454 


Stigmatella aurantiaca DW4 3 1 


NC_0 14623 


10260756 


466 


97.62 




Streptococcus pneumoniae TIGR4 


NC.003028 


2160842 


211 


90.98 


454 


Vibrio Ex25 


NC.013456 


3259580 


176 


91.77 




Vibrio Ex25 


NCD13457 


1829445 


33 


95.31 




Yersinia pestis Nepal5 1 6 


NCD08149 


4534590 


17 


84.21 


SOLiD 
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Table 3 Average Performance (Correct Adjacencies) 


Programs 


Closest 


Top 10 


Top 20 


sis (promer) 


60.96 


50.01 


37.27 


r2cat 


59.72 


44.19 


30.30 


Mauve 


58.65 


46.31 


32.17 


sis (nucmer) 


56.39 


43.95 


28.55 


fill Scaffolds (nucmer) 


48.47 


34.10 


21.32 


OSLay 


46.83 


34.03 


21.28 


fill Scaffolds (promer) 


44.19 


32.94 


21.63 


CONTIGuator 


42.02 


30.35 


20.01 


Projector 2 


36.89 


25.35 


15.51 


ABACAS 


33.80 


23.95 


13.20 



is the one most likely to yield best results. On the other 
hand, given the extremely small sample of the prokary- 
otic world that has been sequenced to date, it is perfectly 
possible that new strains will be sequenced in the future 
for which the closest reference genome will not be par- 
ticularly close. Therefore it is important to understand 
how the performance of a scaffolding program changes 
with increased distance between the query and possible 
reference genomes. 

Results and discussion 

We ran each scaffold program on each of the 23 query 
chromosomes using as reference each of the 20 closest 
chromosomes to the query chromosome. This resulted in 
460 scaffold sets for each program. For a given program 
and a given set of reference chromosomes, we computed 
average results over the 23 query chromosomes. 

Table 3 shows average results for three cases: when 
the reference chromosome used was the closest available, 
then the average (mean) over the 10 closest chromosomes, 



then the average (mean) over all 20 closest chromosomes. 
Performance of each program on each test set given by 
each column in the table. The numbers are percentages 
of correct adjacencies, averaged over all instances in each 
column. In bold are the best results. The table is sorted 
from best to worst in the first column. In all cases the best 
performance was that of SIS (promer). If the median is 
used rather than the mean, then SIS (nucmer) is the best 
program for the closest case, Mauve is the best for the 10- 
closest case, and SIS (promer) is the best for the 20-closest 
case (see Additional file 1). 

In order to evaluate the performance of the programs 
tested with increased distance from the reference chro- 
mosome we created the graphs shown in Figures 8 and 9. 
These graphs show that the performance of all pro- 
grams degrade with increased distance, but again here SIS 
(promer) stays above the others in all cases 

The results presented so far are averages. In Table 4 we 
present results concerning the fraction of the instances 
in which each program was the winner (provider of the 
best scaffold). In the table columns do not add up to 100% 
because of ties. Best results are shown in bold. The table 
is sorted from best to worst in the first column. Using this 
measure SIS (promer) superiority over the other programs 
is even more pronounced. 

Tables 5 and 6 are analogous to Tables 3 and 4, respec- 
tively, but show the results in terms of coverage (see Test- 
ing Methodology). Also under this measure SIS (promer) 
was the best program. 

It is important to note that the pairs of query-reference 
genomes tested were not selected for containing inver- 
sions. So any given pair query-reference could have any 
kind of rearrangement. The success of SIS in the cases 
tested suggests that inversions are indeed widespread 
in prokaryotic genomes (to the extent that genomes in 
Table 2 are a representative sample). 




SIS (promer) 

— SIS (nucmer) 
Mauve 
r2cat 

— OSLay 

fS (nucmer) 
fS (promer) 

— CONTIGuator 

— Projector2 
_ ABACAS 



6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
Closest Reference Genomes 



Figure 8 Average of Correct Adjacencies. Variation in number of correct adjacencies for each program as the reference genome changes from 
the closest to the most distant, in a total of 20 reference genomes, f S is fillScaffolds. 
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SIS (promer) 

— SIS (nucmer) 
Mauve 
r2cat 

— OSLay 

fS (nucmer) 
fS (promer) 

— CONTIGuator 

— Projector2 
ABACAS 



6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
Closest Reference Genomes 



Figure 9 Average of Correct Adjacencies (Cumulative Version). Variation in number of correct adjacencies for each program as the reference 
genome changes from the closest to the most distant, in a total of 20 reference genomes. This is a cumulative version of Figure 8. The points plotted 
correspond to the cumulative averages as more distant genomes are included, f s is fillScaffolds. 



Additional results 

Here we briefly describe additional results for which tables 
and graphs can be found in the Additional file 1. 

Another way of parsing results is to ask the question: 
which program yields best results for each query chro- 
mosome sequence, in terms of number of average correct 
adjacencies? SIS (promer and nucmer) are the best pro- 
grams under this measure. 

Another question to ask is what kind of influence 
does the number of contigs has on a programs perfor- 
mance? We have found that all programs obtain their best 
results for chromosome sequences that have few contigs 
(between 4 and 25). As for worst results, the outcome was 
less clear, since it is not always the case that the largest 
number of contigs results in poorest performance. We 
also determined that fillScaffolds and r2cat seem to be less 
sensitive to variation in number of contigs in the input 
than other programs. These results need to be taken with 
caution, because the bins we used had a relatively small 
number of instances in each (between five and seven). 



Finally, we investigated the influence of duplications on 
the performance of SIS in terms of correct adjacencies. 
We found that on average no more than 13% of incorrect 
adjacencies are due to duplications. 

Running times 

Running times of the tests are dominated by the sequence 
comparison step. For example, in the case of SIS, nucmer 
takes on average 1 minute for a genome pair, and promer 
takes on average 5 minutes. SIS itself takes less than a 
second to compute a scaffold set. OSLay takes about 3 sec- 
onds. After obtaining the conserved blocks, fillScaffolds 
takes about 3 seconds. r2cat takes between 15 and 20 sec- 
onds to determine conserved blocks and about 2 seconds 
to generate a scaffold. ABACAS takes between 10 and 
30 seconds total time. CONTIGuator takes about 2 min- 
utes to determine conserved blocks using BLAST+ and to 
generate a scaffold. Mauve Aligner takes between 8 and 
322 minutes per genome pair (on average, 46 minutes). 
All these tests were executed on the same machine, a 



Table 4 Best Scaffolds (Correct Adjacencies) Table 5 Average Performance (Coverage) 



Programs 


Closest 


Top 10 


Top 20 


Programs 


Closest 


Top 10 


Top 20 


sis (promer) 


56.52 


59.13 


64.57 


sis (promer) 


58.6 


46.0 


34.6 


sis (nucmer) 


39.13 


27.83 


20.87 


r2cat 


56.8 


40.4 


27.5 


Mauve 


26.09 


25.65 


23.26 


Mauve 


55.9 


41.9 


29.5 


OSLay 


21.74 


15.65 


11.30 


sis (nucmer) 


52.4 


39.5 


25.6 


fillScaffolds (nucmer) 


21.74 


11.30 


8.70 


fillScaffolds (nucmer) 


42.9 


29.9 


19.0 


ABACAS 


17.39 


11.74 


7.61 


OSLay 


42.5 


29.0 


18.2 


r2cat 


13.04 


16.09 


15.65 


CONTIGuator 


42.0 


29.3 


19.3 


CONTIGuator 


13.04 


9.57 


6.30 


fillScaffolds (promer) 


40.2 


29.2 


19.5 


Projector 2 


8.70 


6.09 


4.57 


Projector 2 


34.3 


22.5 


13.9 


fillScaffolds (promer) 


0.00 


6.96 


8.04 


ABACAS 


27.8 


19.9 


11.3 
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Table 6 Best Scaffolds (Coverage) 



Programs 


Closest 


Top 10 


Top 20 


sis (promer) 


47.8 


51.7 


56.5 


sis (nucmer) 


39.1 


24.4 


17.0 


Mauve 


34.8 


25.7 


20.2 


OSLay 


17.4 


12.2 


8.9 


fill Scaffolds (nucmer) 


17.4 


11.3 


7.4 


r2cat 


13.0 


17.4 


14.4 


ABACAS 


13.0 


11.7 


7.8 


CONTIGuator 


13.0 


10.0 


6.5 


Projector 2 


8.7 


6.1 


4.6 


fill Scaffolds (promer) 


0.0 


7.0 


7.4 



but this will require a substantial change to the algo- 
rithm presented here, as evidenced by the sophistication 
of the recent algorithm for paired-end read scaffolding 
presented by Gao et al [1]. Another possible improvement 
in the scaffold generating process is to use several refer- 
ence genomes instead of only one, under the rationale that 
some inversions may be missed by using reference genome 
X but may be detected using some other reference genome 
Y. An idea similar to this is described for the program 
PGA4genomics [11]. 

Endnotes 

a www.ncbi.nlm.nih.gov/genomes/lproks.cgi 
b ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT 



standard desktop computer with Intel Core2 Duo proces- 
sor, 3.0 GHz and 4 GB RAM. Projector 2 can only be run 
on a web server. We observed that the total computation 
time on the server was about 1 minute, when the server 
appeared to have a light load. 

Details of these tests and other tests 

In Additional file 1 we present more detailed results of the 
tests above as well as results using pairs of real genomes 
and simulated contigs. SIS also came out as the best pro- 
gram in these other tests. These results are a refinement 
of preliminary tests presented in [24]. 

Availability of SIS 

SIS is available as a web server at http://marte.ic.unicamp. 
br:8747. It can also be freely downloaded as a stand alone 
program from the same website. SIS generates its scaffolds 
both as ordered lists of contigs as well as in nucleotide 
sequence FASTA format. 

Conclusions 

We have presented a new linear-time algorithm for gen- 
erating scaffolds for draft genomes, based on the concept 
of inversion signatures. We implemented this algorithm 
creating program SIS, which is to our knowledge the first 
scaffold program that explicitly models the biological phe- 
nomenon of replicon inversion in prokaryotes. We com- 
pared SIS to seven other programs, and demostrated that 
SIS achieves better performance relative to these other 
programs using a real and diverse suite of test cases. 

In a real world setting, the scaffolds generated by SIS 
can help in the gap closure process. For example, SIS 
output can easily be used as input to primer genera- 
tion programs such as Primer3 [25]. If paired-end read 
data is available, it can be used to check the scaffolds 
provided and to connect separate scaffolds, thus generat- 
ing more complete and reliable contig orderings. Ideally, 
SIS should be able to use paired-end read information, 



Additional file 



Additional file 1 : Figure SI . Variation in the Number of Correct 
Adjacencies (Top 1). Variation in the number of correct adjacencies 
determined by each scaffold program when the reference genome is the 
closest to the query genome. The diamond is the median. Figure S2. 
Variation in the Number of Correct Adjacencies (Top 10). Variation in the 
number of correct adjacencies determined by each scaffold program 
averaged over the 1 0 closest genomes to the query genome. The diamond 
is the median. Figure S3. Variation in the Number of Correct Adjacencies 
(Top 20). Variation in the number of correct adjacencies determined by 
each scaffold program averaged over the 20 closest genomes to the query 
genome. The diamond is the median. Figure S4. Test Cases X Correct 
Adjacencies (Top 1 ). Figure S5. Test Cases X Correct Adjacencies (Top 1 0). 
Figure S6. Test Cases X Correct Adjacencies (Top 20). Figure S7. Correct 
Adjacencies X Number of Contigs (Top 1). Figure S8. Correct Adjacencies 
X Number of Contigs (Top 10). Figure S9. Correct Adjacencies X Number 
of Contigs (Top 20). Figure S10. Example of Dotplot. Pairwise whole 
genome comparison of two Pseudomonas species. The comparison was 
done using nucmer [1 8]. Figure S1 1 . Mycobacterium (All pairs). Variation 
of the distribution of the number of correct adjacencies in the scaffolds 
generated by the various programs for the complete set (210 pairs) of 
Mycobacterium genomes. Figure SI 2. Pseudomonas (All Pairs). Variation of 
the distribution of the number of correct adjacencies in the scaffolds 
generated by the various programs for the complete set (1 53 pairs) of 
Pseudomonadaceae genomes. Figure SI 3. Shewanellas (All Pairs). Variation 
of the distribution of the number of correct adjacencies in the scaffolds 
generated by the various programs for the complete set (1 90 pairs) of 
Shewanella genomes. Figure S"\4. Xanthomonas (All Pairs). Variation of the 
distribution of the number of correct adjacencies in the scaffolds 
generated by the various programs for the complete set (36 pairs) of 
Xanthomonas genomes. Figure SI 5. Mycobacterium (Best Pairs). Variation 
of the distribution of the number of correct adjacencies in the scaffolds 
generated by the various programs for only those pairs of Mycobacterium 
genomes that are closest to each other in the second batch of tests. 
Figure SI 6. Pseudomonas (Best Pairs). Variation of the distribution of the 
number of correct adjacencies in the scaffolds generated by the various 
programs for only those pairs of Pseudomonas genomes that are closest to 
each other in the second batch of tests. Figure SI 7. Shewanellas (Best 
Pairs). Variation of the distribution of the number of correct adjacencies in 
the scaffolds generated by the various programs for only those pairs of 
Shewanella genomes that are closest to each other in the second batch of 
tests. Figure SI 8. Xanthomonas (Best Pairs). Variation of the distribution of 
the number of correct adjacencies in the scaffolds generated by the 
various programs for only those pairs of Xanthomonas genomes that are 
closest to each other in the second batch of tests. Figure SI 9. Average (All 
Pairs). Variation of the distribution of the average number of correct 
adjacencies in the scaffolds generated by the various programs for the 
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complete set of test instances in the second batch. Figure S20. Average 
(Best Pairs). Variation of the distribution of the average number of correct 
adjacencies in the scaffolds generated by the various programs for only 
those pairs of genomes that are closest to each other in the second batch 
of tests. 
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